Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

نویسندگان

  • Fei Tao
  • Carlos Busso
چکیده

Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging handfree intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e., camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An End-to-End Architecture for Keyword Spotting and Voice Activity Detection

We propose a single neural network architecture for two tasks: on-line keyword spotting and voice activity detection. We develop novel inference algorithms for an end-to-end Recurrent Neural Network trained with the Connectionist Temporal Classification loss function which allow our model to achieve high accuracy on both keyword spotting and voice activity detection without retraining. In contr...

متن کامل

A deep architecture for audio-visual voice activity detection in the presence of transients

We address the problem of voice activity detection in difficult acoustic environments including high levels of noise and transients, which are common in real life scenarios. We consider a multimodal setting, in which the speech signal is captured by a microphone, and a video camera is pointed at the face of the desired speaker. Accordingly, speech detection translates to the question of how to ...

متن کامل

Benefits for Voice Learning Caused by Concurrent Faces Develop over Time.

Recognition of personally familiar voices benefits from the concurrent presentation of the corresponding speakers' faces. This effect of audiovisual integration is most pronounced for voices combined with dynamic articulating faces. However, it is unclear if learning unfamiliar voices also benefits from audiovisual face-voice integration or, alternatively, is hampered by attentional capture of ...

متن کامل

Non-linear estimation of voice activ recognition of nois

Feed-forward multi-layer perceptrons (MLP) and recurrent neural networks (RNN) fed with different sets of acoustic features are proposed for computing the presence and absence of speech in continuous speech signal in presence of various levels of background noise. Detailed performance evaluations on voice activity detection (VAD) are reported using the Aurora2, Aurora3 and TIMIT corpora. It is ...

متن کامل

A Recurrent Neural Network Model for solving CCR Model in Data Envelopment Analysis

In this paper, we present a recurrent neural network model for solving CCR Model in Data Envelopment Analysis (DEA). The proposed neural network model is derived from an unconstrained minimization problem. In the theoretical aspect, it is shown that the proposed neural network is stable in the sense of Lyapunov and globally convergent to the optimal solution of CCR model. The proposed model has...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017